The Twofish Encryption Algorithm: A 128-Bit Block Cipher The Twofish Encryption Algorithm: A 128-Bit Block Cipher
by Bruce Schneier ; John Kelsey ; Doug Whiting ; David Wagner ; Chris Hall ; Niels Ferguson
Wiley Computer Publishing, John Wiley & Sons, Inc.
ISBN: 0471353817   Pub Date: 03/01/99
  

Previous Table of Contents Next


It should also be noted that the Twofish round structure can be very nicely pipelined to break up the overall function into smaller and much faster blocks (e.g., q’s, key XORs, MDS, PHT, subkey addition, Feistel XOR). None of these operations individually is slow, but trying to run all of them in a single clock cycle does affect the cycle time. In ECB mode, counter mode, or an interleaved chaining mode, the throughput can be dramatically increased by pipelining the Twofish round structure. As a very simple example of two-level pipelining, we could compute the qi’s, key XORs and MDS multiply for one block during “even” clocks, while the PHT, subkey addition, and Feistel XOR would be computed on the “odd” clocks; a second block is processed in parallel on the alternate clock cycles. Using careful balancing of circuit delays between the two clock phases, this approach allows us to cut the logic delay in half, thus running the clock at twice the speed of an unpipelined approach. Such an approach does not require duplicating the entire Twofish round function logic, but merely the insertion of one extra layer of clocked storage elements (128 bits). Thus, for a very modest increase in gate count, throughput can be doubled, assuming that the application can use one of these cipher modes. It is clear that this general approach can be applied with more levels of pipelining to get higher throughput, although diminishing returns are achieved past a certain point. For even higher levels of performance, multiple independent engines can be used to achieve linear speedups at a linear ear cost in gates. We see no problem meeting NSA’s requirement to “be able to encrypt data at a minimum of 1 Gb/s, pipelined if necessary, in existing technology” [McD97].

Gate Count h Blocks Clocks
/Block
Interleave Levels Clock Speed Throughput (Mbits/sec) Startup Clocks
8000 0.25 324 1 80 MHz 32 20
14000 1 72 1 40 MHz 71 4
19000 1 32 1 40 MHz 160 40
23000 2 16 1 40 MHz 320 20
26000 2 32 2 80 MHz 640 20
28000 2 48 3 120 MHz 960 20
30000 2 64 4 150 MHz 1200 20
80000 2 16 1 80 MHz 640 300
Table 5.5. Hardware Tradeoffs (128-bit Key)

Table 5.5 gives hardware size and speed estimates for the case of 128-bit keys. The first line of the table is a “byte serial” implementation. It uses one clock per S-box lookup, and four clocks per h function (including the MDS). We allow two clocks for the PHT and key addition. With four h functions per round, each round requires 18 clock cycles.

The next line uses a fully wired h function, but still computes the round keys on the fly. The other versions all precompute the round subkeys. The last line is a version that precomputes the S-boxes in to dedicated RAM instead of computing the S-boxes on the fly. Depending on the architecture, the logic will grow somewhat in size for larger keys, and the clock speed (or startup time) may increase, but it is believed that a 128-bit AES scheme will be acceptable in the market long enough that most vendors will choose to implement that recommended key length.

These estimates are all based on existing 0.35 micron CMOS technology. All the examples in the table are actually quite small in today’s technology, except the final (highest performance non-pipeline) instance, but even that is very doable today and will become fairly inexpensive as the next generation silicon technology (0.25 micron) becomes the industry norm.


Previous Table of Contents Next